Model selection

Douwe Molenaar

Systems Biology Lab

2024-11-01

Model selection: Introduction

Model selection

Goals of statistical modeling could be

  • Variable selection: on which predictors variables does a response variable depend?
  • Inference: is the dependence between a predictor and a response variable statistically significant?
  • Prediction: to predict response values from observed predictors as accurately as possible.
    • Predict in a similar setting: modeling dependence is sufficient.
    • Predict in a different setting: mechanistic (causal) knowledge is needed.

Model selection is about determining what a good model is for one or more of these tasks, or how to objectively compare models

Properties of a (linear) model: terminology

Term Symbol Definition
Degrees of freedom \(\text{DF}\) \(n - p\)
Residual sum of squares \(\text{RSS}\) \(\sum_{i=1}^n \left( y_i- f(\vec{x}_i,\vec{\beta}) \right)^2\)
Mean square error \(\text{MSE}\) \(\frac{1}{n} \text{RSS}\)
Residual standard error \(\sqrt{\frac{1}{n - p}\text{RSS}}\)

In which:

  • \(n\) is the number of observations, samples
  • \(p\) is the number of parameters (terms, predictors) used in the model

Fundamental observation

How the RSS depends on the number of predictors

Using 10 random vectors as predictors

Nr. 1 as a true predictor

The data set consists of 10 samples

And the rest are random vectors

Explanation

Why a model with 10 parameters fits a data set with 10 observations perfectly

When columns of the data matrix \(\boldsymbol{X}\) consists of 10 independent vectors (possibly just noise) then in the matrix equation

\[ \boldsymbol{X}\vec{\beta} = \vec{y} \] \(\boldsymbol{X}\) is a (invertible) \(10\times10\) matrix with rank 10, the response \(\vec{y}\) of length 10 will be in its column space and \(\vec{\beta}\) of length 10 has a solution.

This, in contrast to the case where the number of rows in \(\boldsymbol{X}\) is larger than the number of columns (predictors, terms in the model equation).

Conclusions and anticipations

  1. Any model will “predict” the training data to some extent, even if it doesn’t contain informative predictors
  2. More complex models (with more variables and parameters) will usually better “predict” training data
  3. A model that does not describe a true (cor)relation between predictors and response variables is unlikely to predict test data accurately

The third assumption and the previous graphs give us clues about at least two ways to objectively compare the appropriateness of statistical models

  • A better model makes a better prediction of test data rather than training data
    • Compare the performance of models on test data
  • A better model does not fit noise (un-informative predictors)
    • The reduction in RSS by an informative predictor should be much larger than that of an un-informative predictor

Model selection: Comparing models

Increasing model complexity


M1: \(\beta_0 + \beta_1 x\)

M2: \(\beta_0 + \beta_1 x + \beta_2 x^2\)





M3: \(\beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3\)

M4: \(\beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4\)

Variance explained and MSE of these models

Model Parameters MSE \(R^2\)
M0 1 1169.2 0.0000
M1 2 74.6 0.9362
M2 3 13.7 0.9883
M3 4 11.0 0.9906
M4 5 10.5 0.9910


This shows that model M4, the model with most parameters is the best model, right?

The largest model is not necessarily the best model

  • A very large, complex model is more likely to fit the training data, even if there is no true (cor)relation between predictors and response variable (“over-fitting”)

Even if the model describes underlying true relations

  • A very large model is likely to fit measurement noise in addition to the underlying relation (overfitting)

The largest model is not necessarily the best model



  • Synthetic training data were generated originating from a parabola (dashed line)
  • Noise was added
Code
set.seed(1234)
N <- 30
x <- 0:N
f <- function(x)  x + 10 - 0.02 * x^2 
ytrue <- f(x)
y <- ytrue + rnorm(length(x),0,4)
noisydata <- data.frame(x=x, y=y)

Example: better fit, worse prediction

Fitting on a training set

  • Linear regression: MSE=16
  • Polynomial regression: MSE=9.2

Predicting a validation set

  • Linear model: MSE=12
  • Polynomial model: MSE=18

What is a better model?

One that has a higher “variance explained”?

  • A model with more parameters (more complex) will have more variance explained
  • It will even “explain” non-existing relations (over-fitting)

Three types of solutions to choose the better predicting model

  1. Testing how well the model predicts a test data set
  2. Impose a penalty on the number of parameters
  3. Use a statistical test that shows whether a model starts fitting the error

The bias-variance trade-off

Complex models (many parameters) tend to have large variance

When repeatedly fitting to different training sets the fitted function varies a lot: small changes in the training set lead to large changes in the fitted function (parameter values).


Simple models (few parameters) tend to have large bias

Too simple models deviate systematically from the true relation: they can not fit the true relation.

The bias-variance trade-off graphically


The mean square error of a validation data set is the sum of

  1. Variance of the fitted equation (increases with model size)
  2. Squared bias of the fitted equation (decreases with model size)
  3. The variance of the true measurement error (independent of model size)

Model selection: using test data

Model selection using a test data set

Criterion: the best model is the one that minimizes the MSE of a test data set


Test data set (red)

MSE on test data set

Model selection using cross-validation

If the data set is large enough:

  • Divide the data randomly into \(n\) (= 5-10) equal parts
  • Use every part once as a test set, and the joint remaining \(n-1\) parts as a training set
  • This yields \(n\) almost independent validation results of the accuracy of the model

Variant: Leave-one-out cross-validation

Leave-one-out (LOO) cross-validation

  • For small data sets
  • Leave each data point out of the data set once, and predict its value using the model fitted to the remaining data

Model selection: complexity penalties

Measures of goodness of fit with a complexity penalty

  1. Adjusted \(R^2\)
  2. Bayes and Akaike Information Criteria (BIC, AIC)

Adjusted \(R^2\)

\[\text{Adjusted }R^2 = 1 - \frac{\text{RSS}/(n-p)}{\text{TSS}/(n-1)}\]

where \(n\) is the number of measurements and \(p\) the number of parameters.


Adjusted \(R^2\) of the synthetic models

Bayesian Information Criterion

\[\text{BIC} = p \ln(n) - 2 \ln(\hat{L})\]

where \(\hat{L}\) is the likelihood of the data (see Chapter 19) at the optimal estimate of the model. When the error is normal distributed this is approximately equal to

\[\text{BIC} \approx p \ln(n) + n \ln\left( \text{MSE} \right )\] A lower BIC is an indication of a better model.



BIC of the synthetic models

Summary of different model selection methods


Model \(R^2\) MSE test set Adj. \(R^2\) BIC
M0 0 1040 0 113.7
M1 0.9362 86.1 0.929 85.84
M2 0.9883 21.6 0.985 69.57
M3 0.9906 25.5 0.987 69.61
M4 0.991 28.1 0.985 71.49

Model selection: using statistical tests

Simplifying assumption about the error \(\epsilon\)

Under the condition that the model describes the true relation, and:

  • The error \(\epsilon_i\) is sampled from a normal distribution \(\mathcal{N}(0,\sigma^2)\)
  • \(\sigma\) is independent of the values of \(\vec{x}\)

If that assumption is true

  • We can perform common statistical tests (ANOVA, for example)
  • Confidence intervals for parameters can be easily calculated

If the assumption is not true but:

  • The error is sampled from another member of the Exponential Family of error distributions (Poisson, Binomial, Exponential, …)
    • We can use the generalized linear modeling (GLM) framework
  • The error is not sampled from such a distribution
    • If the data set is large enough we can resort to resampling techniques to perform statistical tests or derive confidence intervals

ANOVA on linear model fits

ANOVA (ANalysis Of VAriance) offers a statistical test for the question whether one model provides a better fit than another model

  • ANOVA uses the “scaled” residual sum of squares of both models: \(\frac{\text{RSS}}{d}\), where \(d\) is the number of degrees of freedom of the fit.
  • The more complex model 2, having additional parameters, does not provide a better fit than model 1 when

\[F = \frac{(\text{RSS}_1 - \text{RSS}_2)/(d_1 -d_2)}{\text{RSS}_2/d_2} \approx 1\]

Whether the ratio \(F\) is significantly larger than 1 is tested by the F-test.

Intuitive explanation of the F-test

\[F = \frac{\text{slope red line}}{\text{slope yellow line}}\]

If \(F>1\) then the additional term and parameter in Model 2 contribute more to the reduction in RSS than expected by chance under the null hypothesis.


What is the null hypothesis?

ANOVA concerns the comparison of RSS of different models

  • ANOVA: analysis of variance, i.e. variance explained and variance left over
  • In an ANOVA we always compare RSS of two models

Always implicit comparison to M0 when testing a single model!


Explicitly comparing models M0 and M1

anova(M0, M1)
Analysis of Variance Table

Model 1: y ~ 1
Model 2: y ~ x
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     10 12861.7                                  
2      9   820.5  1     12041 132.08 1.111e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Implicitly comparing models M0 and M1

anova(M1)
Analysis of Variance Table

Response: y
          Df  Sum Sq Mean Sq F value    Pr(>F)    
x          1 12041.2 12041.2  132.08 1.111e-06 ***
Residuals  9   820.5    91.2                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing models 1 and 2 using ANOVA

We can compare any pair of models

anova(M1, M2)
Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ x + I(x^2)
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
1      9 820.47                                 
2      8 150.25  1    670.21 35.684 0.000333 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Demonstrates that the reduction in \(\text{RSS}\) by adding term \(x^2\) is significant.

Comparing models 2 and 3 using ANOVA

anova(M2, M3)
Analysis of Variance Table

Model 1: y ~ x + I(x^2)
Model 2: y ~ x + I(x^2) + I(x^3)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1      8 150.25                           
2      7 121.27  1    28.987 1.6732 0.2369

Demonstrates that the reduction in \(\text{RSS}\) by adding term \(x^3\) is not significant.

Convenient use of anova(): comparing all models

anova(M1, M2, M3, M4)
Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ x + I(x^2)
Model 3: y ~ x + I(x^2) + I(x^3)
Model 4: y ~ x + I(x^2) + I(x^3) + I(x^4)
  Res.Df    RSS Df Sum of Sq       F   Pr(>F)   
1      9 820.47                                 
2      8 150.25  1    670.21 34.7658 0.001057 **
3      7 121.27  1     28.99  1.5036 0.266052   
4      6 115.67  1      5.60  0.2904 0.609357   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using ANOVA as a criterion to select the best model we would choose M2.

Comparing a pair of models with ANOVA is only valid if:

  1. Both models are fitted to the same data set
  2. Models are linear
  3. Models are nested, i.e. the simpler model equals the more complex model with one or more parameters equal to 0


How do you make parameters equal to 0 in lm()?

Example of selection of linear models using ANOVA

Body weight

Mixture of categorical and continuous predictors

age weight height gender
21 65.6 174.0 M
23 71.8 175.3 M
28 80.7 193.5 M
23 72.6 186.5 M
22 78.8 187.2 M
21 74.8 181.5 M

Response weight: \(w\)


Predictors:

  • gender \(g\) (categorical)
  • age \(a\), height \(h\) (numeric)

Predicting weight using height and gender

Different models, different number of parameters

M1: single line

\[w = \beta_0 + \beta_2 \, h\]

M2: different intercepts

\[w = \underbrace{\beta_0 + \beta_1 \, g}_{\text{intercept}} + \beta_2 \, h\]

M3: different intercepts, different slopes

\[w = \underbrace{\beta_0 + \beta_1 \, g}_{\text{intercept}} + \underbrace{(\beta_2 + \beta_3 \, g)}_{\text{slope}} h\]

R formulas:

weight ~ height



weight ~ gender + height




weight ~ gender * height

Which of these models are nested?

We can always add the simplest model: the null model

M0: the average weight of all people

\[ y = \beta_0 \]

R formula:

weight ~ 1

This simply predicts the weight to be the average weight, and does not use predictors

The \(\text{RSS}\) of this model is by definition equal to the \(\text{TSS}\) and equals \(9.01\times 10^{4}\). Its \(\text{MSE}\) equals \(178\).

Some terminology

  • Some people call ANOVA on models with mixtures of categorical and numeric predictors ANCOVA (analysis of covariance)

Model M3 has the formula \(w = \beta_0 + \beta_1 \, g + \beta_2 \, h + \beta_3 \, g h\)

  • We call \(\beta_3 \, g h\) an interaction term. In general, interaction terms are terms in which different variables are multiplied

Comparing the models using ANOVA

anova(M0,M1,M2,M3)
Analysis of Variance Table

Model 1: weight ~ 1
Model 2: weight ~ height
Model 3: weight ~ gender + height
Model 4: weight ~ gender * height
  Res.Df   RSS Df Sum of Sq        F    Pr(>F)    
1    506 90123                                    
2    505 43753  1     46370 599.4181 < 2.2e-16 ***
3    504 39043  1      4710  60.8802 3.534e-14 ***
4    503 38912  1       132   1.7043    0.1923    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model M21 with different intercepts but same slope, is the best among our models, according to this criterion

Graphical display of model M2

Questions

Which other model might you want to test?

  • What is its mathematical formula?
  • What is its R-formula?
  • To which other models could/should you compare it using ANOVA?


Would you want to test data transformations?

  • Which variable?
  • Which transformation?
  • For what reason?

Answers

Other model

\[w = \beta_0 + \beta_2 \, h + \beta_3 \, g h\] Let’s call this model M2’

weight ~ height + height:gender

Compare it to M1 which is nested in M2’, and to M3 in which M2’ itself is nested.

This model is one with a common intercept for males and females but with different slopes


Data transformations

Perhaps a log-transformation of weight. The spread of data above the fitted line seems to be larger than below: the residuals are probably not normal-distributed.